A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)
نویسنده
چکیده
The concept of overfitting in model selection is explained and demonstrated. After providing some background information on information theory and Kolmogorov complexity, we provide a short explanation of Minimum Description Length and error minimization. We conclude with a discussion of the typical features of overfitting in model selection. 1 The paradox of overfitting Machine learning is the branch of Artificial Intelligence that deals with learning algorithms. Learning is a figurative description of what in ordinary science is also known as model selection and generalization. In computer science a model is a set of binary encoded values or strings, often the parameters of a function or statistical distribution. Models that parameterize the same function or distribution are called a family. Models of the same family are usually indexed by the number of parameters involved. This number of parameters is also called the degree or the dimension of the model. To learn some real world phenomenon means to take some examples of the phenomenon and to select a model that describes them well. When such a model can also be used to describe instances of the same phenomenon that it was not trained on we say that it generalizes well or that it has a small generalization error. The task of a learning algorithm is to minimize this generalization error. Classical learning algorithms did not allow for logical dependencies [MP69] and were not very interesting to Artificial Intelligence. The advance of techniques like neural networks with back-propagation in the 1980’s and Bayesian networks in the 1990’s has changed this profoundly. With such techniques it is possible to learn very complex relations. Learning algorithms are now extensively used in applications like expert systems, computer vision and language recognition. Machine learning has earned itself a central position in Artificial Intelligence. A serious problem of most of the common learning algorithms is overfitting. Overfitting occurs when the models describe the examples better and better ∗The Paradox of Overfitting [Nan03], Chapter 1. ar X iv :1 00 5. 23 64 v2 [ cs .L G ] 1 4 M ay 2 01 0 2 AN EXAMPLE OF OVERFITTING but get worse and worse on other instances of the same phenomenon. This can make the whole learning process worthless. A good way to observe overfitting is to split a number of examples in two, a training set, and a test set and to train the models on the training set. Clearly, the higher the degree of the model, the more information the model will contain about the training set. But when we look at the generalization error of the models on the test set, we will usually see that after an initial phase of improvement the generalization error suddenly becomes catastrophically bad. To the uninitiated student this takes some effort to accept since it apparently contradicts the basic empirical truth that more information will not lead to worse predictions. We may well call this the paradox of overfitting . It might seem at first that overfitting is a problem specific to machine learning with its use of very complex models. And as some model families suffer less from overfitting than others the ultimate answer might be a model family that is entirely free from overfitting. But overfitting is a very general problem that has been known to statistics for a long time. And as overfitting is not the only constraint on models it will not be solved by searching for model families that are entirely free of it. Many families of models are essential to their field because of speed, accuracy, easy to teach mathematically, and other properties that are unlikely to be matched by an equivalent family that is free from overfitting. As an example, polynomials are used widely throughout all of science because of their many algorithmic advantages. They suffer very badly from overfitting. ARMA models are essential to signal processing and are often used to model time series. They also suffer badly from overfitting. If we want to use the model with the best algorithmic properties for our application we need a theory that can select the best model from any arbitrary family. 2 An example of overfitting Figure 1 on page 3 gives a good example of overfitting. The upper graph shows two curves in the two-dimensional plane. One of the curves is a segment of the Lorenz attractor, the other a 43-degree polynomial. A Lorenz attractor is a complicated self similar object. Here it is only important because it is definitely not a polynomial and because its curve is relatively smooth. Such a curve can be approximated well by a polynomial. An n-degree polynomial is a function of the form f(x) = a0 + a1x+ a2x 2 + · · ·+ anx , x ∈ R (1) with an n+ 1-dimensional parameter space (a0 . . . an) ∈ R. A polynomial is very easy to work with and polynomials are used throughout science to model (or approximate) other functions. If the other function has to be inferred from a sample of points that witness that function, the problem is called a regression problem. Based on a small training sample that witnesses our Lorenz attractor we search for a polynomial that optimally predicts future points that follow the same
منابع مشابه
Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity
The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles minimum description length (MDL) and minimum message length (MML), abstracted as the ideal MDL principle and defined from Bayes’s rule by means of Kolmogorov complexity. The basic condition under which the ideal principle should be app...
متن کاملComputing Minimum Description Length for Robust Linear Regression Model Selection
A minimum description length (MDL) and stochastic complexity approach for model selection in robust linear regression is studied in this paper. Computational aspects and implementation of this approach to practical problems are the focuses of the study. Particularly, we provide both algorithms and a package of S language programs for computing the stochastic complexity and proceeding with the a...
متن کاملA New Minimum Description Length
The minimum description length(MDL) method is one of the pioneer methods of parametric order estimation with a wide range of applications. We investigate the definition of two-stage MDL for parametric linear model sets and exhibit some drawbacks of the theory behind the existing MDL. We introduce a new description length which is inspired by the Kolmogorov complexity principle.
متن کاملStochastic complexity and model selection from incomplete data
The principle of minimum description length (MDL) provides an approach for selecting the model class with the smallest stochastic complexity of the data among a set of model classes. However, when only incomplete data are available the stochastic complexity for the complete data cannot be numerically computed. In this paper, this problem is solved by introducing a notion of expected stochastic ...
متن کاملComplexity Approximation Principle
We propose a new inductive principle, which we call the complexity approximation principle (CAP). This principle is a natural generalization of Rissanen’s minimum description length (MDL) principle and Wallace’s minimum message length (MML) principle and is based on the notion of predictive complexity, a recent generalization of Kolmogorov complexity. Like the MDL principle, CAP can be regarded...
متن کاملThe MDL model choice for linear regression
In this talk, we discuss the principle of Minimum Description Length (MDL) for problems of statistical modeling. By viewing models as a means of providing statistical descriptions of observed data, the comparison between competing models is based on the stochastic complexity (SC) of each description. The Normalized Maximum Likelihood (NML) form of the SC (Rissanen 1996) contains a component tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1005.2364 شماره
صفحات -
تاریخ انتشار 2003